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Abstract — A common approach to protect confidential infor- 
mation is to use a stream cipher which combines plain text 
bits with a pseudo-random bit sequence. Among the existing 
stream ciphers, Non-Linear Feedback Shift Register (NLFSR)- 
based ones provide the best trade-off between cryptographic 
security and hardware efficiency. In this paper, we show how to 
further improve the hardware efficiency of Grain stream cipher. 
By transforming the NLFSR of Grain from its original Fibonacci 
configuration to the Galois configuration and by introducing a 
clock division block, we double the throughput of the 80 and 128- 
bit key lbit/cycle architectures of Grain with no area penalty. 

I. Introduction 

Constrained environments applications such as hardware 
authentication devices (RFID, etc), smartcards, and wireless 
networks (Bluetooth, NFC, etc) require power-efficient, area- 
efficient and high-performance hardware encryption systems 
with large security margins. Until today, no adequate crypto- 
graphic solution has been proposed which satisfies the extreme 
limitations of devices like RFIDs [1]. Even the most compact 
of today's encryption systems - Non-Linear Feedback Shift 
Register (NLFSR)-based stream ciphers - contain an order 
of magnitude more gates than can be dedicated for security 
functionality in the low-cost RFID tags [2]. The lack of 
adequate encryption mechanisms gives rise to many security 
and privacy problems and blocks off a variety of potential 
applications of RFIDs. 

Motivated by these needs, in 2004-2008 the EU ECRYPT 
network carried out the eSTREAM project with the objective 
to identify the best stream ciphers designs [3]. Stream ciphers 
Grain-80 [4], Trivium [5], and Mickey-v2 [6] were selected 
as finalists for the hardware-oriented profile. Grain-80 with 1 
bit/cycle throughput has the smallest hardware among all eS- 
TREAM candidates, which makes it a particularly interesting 
case. 

Grain-80 consists of one 80-bit LFSR [7], one 80-bit 
NLFSR [8], and a function combining selected state bits. The 
shift registers take almost 80 percent of the total area of the 
system and define its critical path. In this paper, we show 
that by transforming the shift registers of Grain from their 
original Fibonacci configuration to the Galois configuration, 
we can significantly improve its throughput. In the Fibonacci 
configuration of shift registers, the feedback is applied to 
the first bit of the register only. In the Galois configuration, 
the feedback can be applied to any bit. Thus, the depth of 
the circuits implementing feedback functions of the Galois 
configuration can potentially be smaller, leading to shorter 
propagation time and higher throughput. 



However, unlike the LFSR case in which the mapping from 
the Fibonacci configuration to the Galois configuration is one- 
to-one, in the NLFSR case multiple Galois NLFSRs can be 
equivalent to a given Fibonacci one [9]. The problem of 
selecting a "best" Galois NLFSR for a given Fibonacci one 
is still open. One of the contributions of this paper is finding 
the minimal-throughput Galois configurations of NLFSRs for 
Grain-80 and Grain- 128 [10]. Another contribution is the in- 
troduction of the clock division block which divides the clock 
frequency of Grain by two or four during the initialization 
phase. Without such a block, the potential benefits of the 
Galois configuration can not be utilized. 

II. Background 

A. Definition of NLFSRs 

A Non-Linear Feedback Shift Register (NLFSR) consists 
of n binary storage elements, called bits. Each bit i e 
{0,1,..., n — 1} has an associated state variable Xi which 
represents the current value of the bit i and a feedback function 
fi : {0,1}™ — > {0,1} which determines how the value of i is 
updated. 

A state of an NLFSR is an ordered set of values of its state 
variables. At every clock cycle, the next state is determined 
from the current state by updating the values of all bits 
simultaneously to the values of the corresponding fa's. The 
output of an NLFSR is the value of its 0th bit. 

If for all i e {0, 1, . . . , n — 2} the feedback functions are 
of type fi = Xi+i, we call an NLFSR the Fibonacci type. 
Otherwise, we call an NLFSR the Galois type. 

Two NLFSRs are equivalent if their sets of output sequences 
are equal. 

B. The Transformation from the Fibonacci to the Galois 
Configuration 

Let fi and fj be feedback functions of bits i and j of an 

n-bit NLFSR, respectively. The operation shifting, denoted by 
p 

fi —> fj, moves a set of product-terms P from fi to fj. The 
index of each variable x% of each product-term in P is changed 

t° x (k-i+j) mod n- 

The terminal bit t of an n-bit NLFSR is the bit with the 

maximal index which satisfies the following condition: For all 

bits i such that i < r, f is of type f = x i+ i. 

Definition 1: An n-bit NLFSR is uniform if the following 
two conditions hold: 

(a) all its feedback functions are singular functions of type 

fii%Qi • • ■ 7 x n-l) — x (i+l)mod n © 9ii. x 0i • ■ • > x n-l)i 



where g t does not depend on X( i+1)mod n , 
(b) for all its bits i such that i > r, the index of every 
variable of gi is not larger than r. 

Theorem 1: [9] Given a uniform NLFSR with the terminal 

p 

bit t, a shifting g T — > g T /, r < r, results in an equivalent 
NLFSR if the transformed NLFSR is uniform as well. 

III. The Description of Grain 

There are two versions of Grain: 80-bit [4] key and 128- 
bit key [10]. Both consist of an LFSR, an NLFSR, and two 
combining functions. 

In Grain-80 the shift registers are 80-bits. They are both the 
Fibonacci type, i.e. all bits except the 79th repeat the value of 
the previous bit. The feedback function of the 79th bit of the 
LFSR is given by: 

j*79 = S&2 © S51 © S38 © «23 © «13 © S 

where Sj is the state variable of the ith bit, i € {0,1, ... , 79}. 

The feedback function of the of the 79th bit of the NLFSR 
is given by: 

579 = s © b © 6 62 © 6 60 © b 5 2 © &45 © ^37 © 633 © 6 28 
©621 © &14 © b g © & 6 3& 6 o © &37&33 © 61569 © b 60 b 52 b 45 
©633628621 © 66364562869 © 660652637633 © 6 63 66o6 2 i&i5 
©663660652645637 © 63362862161569 © 652645637633628621 

where 6; is the state variable of the ith bit, i € {0, 1, ... , 79}. 

The first combining function of Grain-80 produces it output 
value based of the selected bits from the NLFSR and the 
LFSR: 

H = S 25 © 6 63 © S3S4 © S46S4 © S4663 © S 3 S 2 5S46® 
S3S46S4 © S3S46663 © S25«46663 © S 46 S 4 663 

The second combining function of Grain-80 generates the 
output stream of the system from the selected bits from the 
NLFSR and LFSR states and the output of H: 

Z=J2b k ®H, 

keA 

where A= {1,2,4,10,31,43,56}. 

For Grain-128, the corresponding functions are: 

/l27 = S © S 7 © S 38 © S 70 © S 8 i © S 96 

3127 = S © 6 © & 26 © 656 © 691 © 6 96 © 6 3 6 67 © 611613 

©61761s © ©627659 © 6 4 o& 48 © 6 6 i6 65 © 6 6 86 8 4 

H = 6i 2 S 8 © S13S20 © 695S42 © S60S79 © 612695S95 

Z = EkeA b k © S93 © H 

where A = {2, 15, 36, 45, 64, 73, 89}. 

Before generating a stream of data, a cipher must be 
initialized with default keys. During the initializing phase the 
cipher does not produce any output for 160 clock cycles for 
Grain-80 and 256 cycles for Grain-128. The output of the Z 
function is XOR-ed with the outputs of LFSR and NLFSR 
and then fed into the inputs of both shift registers, as shown 
in Figure [T] After the initialization, the loops are opened and 
there is no feedback between the two shift registers. 



It is possible to increase the throughput of Grain at the 
expense of extra hardware by introducing parallelism in its 
architecture. In parallelized versions of Grain, in each clock 
cycle blocks of duplicated NLFSR and LFSR feedback func- 
tions produce output bits in parallel. To allow for up to 16 
(32) degrees of parallelization, Grain-80 (128) is designed so 
that the bits 65 < i < 79 (97 < i < 127) of the shift registers 
are not used in the feedback functions or in the input to the 
combining functions. 

IV. Grain with Galois Configuration 

Grain can be modified by transforming its LFSR and 
NLFSR from their original Fibonacci configurations to the 
Galois configurations. The transformation of LFSRs is done 
using standard techniques, in this section we only describe the 
transformation of NLFSRs. 

The NLFSR of Grain-80 (128) can be transformed to the 
Galois configuration by shifting the product-terms of the 
feedback function of 79th (127th bit) to the feedback functions 
of bits with lower indexes. By Theorem [T[ if the NLFSR after 
shifting satisfies the conditions of the Definition [T] then it 
produces the same sets of output sequences as the NLFSR 
before shifting. 

Ideally, in order to maximize the throughput, we want 
to distribute the products equally among feedback functions. 
However, according to [9], to guarantee equivalence of NLF- 
SRs before and after shifting, we cannot shift to bits with 
indexes lower that the bit t which is given by: 

r = max (maxSndex(p) — min_index(p)), 
VpeP 

where P is the set of all product-terms of the feedback function 
of the Fibonacci NLFSR, and minJndex(p) (maxJndex(p)) 
denotes the minimal (maximal) index of variables the product- 
term p. 

For Grain-80, the product-term with the maximal difference 
in indexes of variables is 66364562861769, sot — 54. For Grain- 
128, we have r — 64 due to the product-term 63667. 

However, in order to avoid modifications of the encrypt- 
ing algorithm of Grain, we need to guarantee not only the 
equivalence of the sequences of output bits, but also the 
equivalence of the sequences of of all internal bits of the 
NLFSR used by the combining functions. A modification of 
the encrypting algorithm could lead to undesirable changes in 
the Grain security. For Grain-80, the bit 63 of the NLFSR 
is used in the function H, and bits 1,2,4,10,31,43,56 are 
used in the function Z. Since 56 and 63 are greater than 
54, we cannot use r = 54 as the terminal bit of the Galois 
configuration. We need to set the terminal bit to 63. Then, 
for all bits i 6 {0, 1, . . . , 62}, the feedback functions will 
be of type gi = bi+i, an the output sequences of the bits 
i € {1, 2, ... , 63} will be the same as the output sequence of 
the bit shifted in time. Consequently, the algorithm of Grain 
will not change. 

For Grain-128, the bits 12 and 95 of the NLFSR are 
used in H and the bits 2,15,36,45,64,73,89 are used in Z. 
Therefore, the terminal bit has to be 95. 



After we have chosen the position of the terminal bit, we 
can start shifting products from the function (779 (3127) to the 
functions with indexes larger or equal than the terminal bit. 
Shifting can be done in many different ways. At present there 
is no systematic technique which guarantees that the transfor- 
mation produces an NLFSR with the minimal throughput for 
a given technology. We found the solutions presented below 
by trying many different choices. 

A. One Bit per Cycle Version 

According to our simulation results, the following Galois 
NLFSR results in the maximal throughput for lbit/cycle ver- 
sion of Grain-80 : 
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Here and further in this section, all omitted feedback functions 
are of type g l = b l+1 . 

For Grain-128, the maximal-throughput Galois NLFSR is: 

5127 = so © 6 © 6 3 6 67 
5124 = 6125 © 6 6 64 
5ii6 = 6117 © 6 6 2 
5no = 6 m © 6 &i 
5102 = 6103 © 671 
5101 = 6102 © 6 
5100 = 6101 © 6 6 32 
599 = 6100 © 6 6 3 
598 = 699 © 627 
597 = 698 © 633654 
596 = 697 © 630634 
595 = 6g 6 © 6 8 6i 6 

B. Multiple Bit per Cycle Version 

We can extend the theory presented in [9], [11] to k 
bits/cycle versions of Grain by restricting bit positions to 
which the feedback can be applied. It is easy to see that, to 
ensure times k degree of parallelization of an n-bit Galois 
NLFSR with the terminal bit r, all bits except 

n-l-i-k, for alii = {0,1,..., L(n-l-r)/feJ -1} 
should have feedback functions of type g = 6(1+1). 



So, for example, for 4 bits/cycle version of Grain-80, we 
can apply feedback to the bits 79,75,71 and 67: 

579 = so © 6 © 6 62 © 633 © 6 28 © 621 © 61569 
© 652645637633628621 

575 = 676 © 641 © 633 © 65 © 659656 © 633629 © 65964162465 

571 = 672 © 644 © 625620613 © b 55 b 52 b 13 b 7 © 6256206136761 
567 = 6 68 © 6 48 © 6 2 © 648640633 © 643640625621 
© 651643640633625 

For 8 bit/cycle version of Grain-80, we can apply feedback to 
the bits 79 and 71: 

579 = S © 6 © 614 © 69 © 61569 © 660652645 © 633623621 

© 6 60 © 660652637633 © 663660621615 © 63362362161569 

571 = 6 72 © 644 © 637 © 6 29 © 6 25 © 620 © 613 © 655652 

© 654 © 629625 © 65563762061 © 655652644637629 
© 644637629625620613 

For 16 bit/cycle version of Grain-80, we can apply feedback 
only to the bit 79, i.e. no transformations can be done. 

For 4 bit/cycle version of Grain-128, we can apply feedback 
to the bits 127,123,119,115,111,107 103 and 99: 

5127 = s © 6 © 6 3 6 6 7 

5123 = 6124 © 6 6 46 8 
5119 = 6120 © 6365 
5115 = 61 16 © 649653 

5in = 6112 © 6i6 2 
5107 = 6108 © 6 6 © 6 76 

5103 = 6104 © 6 6 7 © 63635 
599 = 6100 © 6 28 © 612620 

For 8 bit/cycle version of Grain-128, we can apply feedback 
to the bits 127,119,111,103: 

5127 = s © 6 © 6 56 © 6 3 6 6 7 

5119 = 6120 © 618 © 6 8 8 © 6365 

5in = 6112 © 675 © 6i6 2 © 6 52 6 68 
5103 = 6104 © 63635 © 6i 6 6 24 © 637641 

For 16 bit/cycle version of Grain-128, we can apply feedback 
to the bits 127 and 111. 

5127 = s © 6 © 6 56 © 6 3 6 6 7 © 611613 © 640643 
5111 = 6112 © 610 © 675 © 6 80 © 6i6 2 © 611643 © 645649 
© 652653 

For 32 bit/cycle version of Grain-128, we can apply feed-back 
only to the bit 127, i.e. no transformations can be done. 

C. Design Details 

By transforming Grain's shift registers to the Galois con- 
figuration as described in the previous section, we can obtain 
up to 40 % reduction in their propagation time. However, the 
clock frequency of the overall Grain system improves only 
about 10%. The problem is in the hardware architecture of 
Grain during key initialization, during which the output value 
of Z(x) is fed back to the LFSR and NLFSR making two 
loops, as shown in Figure [T] After the transformation from the 
Fibonacci to the Galois configuration, due to the reduction of 
the critical path in the NLFSR, the critical path of the system 



is no longer in the NLFSR but is in the initialization loops, 
which are closed only during initialization. Thus, the highest 
frequency that the system supports during initialization is 
lower than the highest frequency supported during key stream 
generation (see Table IB. 
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Fig. 1. Initialization loops Grain 
TABLE I 

Clock frequencies of Galois Grain-80 
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Fig. 2. Clock divider by four 

To obtain a higher improvement in the throughput of Grain, 
we introduce a clock division block to divide the frequency of 
the clock during the initialization phase. The clock divider is 
realized as a simple block containing one or two D-flipflops 
which divides the clock frequency of the system by 2 or 4. 
In Figure [2] we show the structure of the clock division block 
for division of the clock frequency by 4. In some versions of 
Grain, division by 2 is sufficient to ensure correct operation 
during the initialization phase. Division by 3 would be suitable 
in some cases, but it would overcomplicate the hardware for 
only a modest speedup of the initialization phase. The clock 
division block is a very small component. Clock division by 
four gives area overhead of 25.67 GE and negligible power 
consumption. 
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Fig. 3. Grain with Galois configuration and clock divider by four 

Grain always moves from the slower to the faster clock 
frequency and the run signal is set internally by the counter 
on the positive edge of the clock. Because of the delay in the 
production of the run signal, the first clock cycle of the key 




generation phase will be shortened, which could potentially 
lead to critical path violations in a performance-optimized 
design such as Grain. We can handle this problem by using 
a flip-flop in front of the run signal which is output by the 
counter. In this case, if the run signal rises to 1 after a positive 
edge of the faster clock signal, the clock of the system changes 
to the faster clock in the next positive edge of the system. This 
solution is shown in Figure [4] 
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Fig. 4. A: delay between positive edges of two clocks. B:delay between run 
signal and high frequency clock. C: shortened clock 



V. Experimental Results 

We have synthesized the Fibonacci and the Galois versions 
of Grain using Cadence RTL compiler in the TSMC 90nm 
standard cell technology library. Since the synthesis tool does 
not handle multiple clocking, we set the two initialization 
loops as false paths and optimized the designs for the key- 
generation phase. 

Table [II] shows the results for throughput, power consump- 
tion, area, and frequency. Area is measured in terms of 
NAND2 Gate Equivalents (GE). The total power consumption 
of the system is estimated as a combination of dynamic and 
leakage power for operation at 25 C, with a power supply of 
1.2 V at 10MHz clock frequency as in [2]. 

As we can see, the throughput for Galois lbit/cycle Grain-80 
and Grain- 128 is more than doubled compared to Fibonacci. 

Trivium is the highest ranked finalist in the eSTREAM 



project. In Table III we compared the frequency and area 
of Trivium (T) and Grain-80 with Galois configuration(G). 
Both ciphers were implemented in TSMC 90 nm technology. 
Due to the Galois configuration, Grain-80 (lbit/cycle) is faster 
and smaller than Trivium (lbit/cycle), with a significantly 
better throughput/area ratio. This is an important result for 
applications such as RFID systems which require efficiency 
in both throughput and area. The throughput/area figures 
are compared graphically in Figure [5] where the figures for 
the Fibonacci configuration (Grain(F)) of Grain-80 are also 
reported. 

TABLE III 

Comparison between Trivium and Grain-80 



Block 


Freq (GHz) 


Area (GE) 


Size 


T G 


T G 


1 


3.8 4 


2810 1772 


4 


3.4 2.7 


2955 2471 


8 


3.6 2.3 


3763 3575 


16 


3.6 1.7 


3841 5768 



VI. Conclusion 

In this paper, we presented an improved version of the 
Grain stream cipher. We found new implementations for its 



TABLE II 

Synthesis results for Galois and Fibonacci configurations of Grain in TSMC 90nm technology 



Cipher 


Block Size 


Frequency (GHz) 


Area (GE) 


Power (mW) 


Throughput (Gbit/Sec) 
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Fib. 


impr. 
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Fib. 


impr. 


Galois 


Fib. 


impr. 
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1 
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110% 


1772 


1743 


0% 
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Fig. 5. The Throughput/area ratios of Trivium and Grain 

NLFSRs which generate the same cryptographically strong 
pseudo-random bit sequences as the ones of the original Grain, 
but have a better hardware efficiency. The presented technique 
is general and can be applied to any NLFSR-based stream 
cipher. Its efficiency depends on the feedback ufnction of the 
NLFSR and the desired degree of parallelization. For Trivium 
the presented technique brings no improvement. 
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